Skip to content

Conversation

@Essoz
Copy link
Collaborator

@Essoz Essoz commented Jan 6, 2026

PR Description: Low-Overhead Instrumentation Support

Summary

This PR introduces dynamic instrumentation policies (sampling and warm-up) to significantly reduce overhead for large-scale training runs. It refactors the instrumentation control logic to be injected directly into training and evaluation loops, ensuring robust state management and correct application of policies across different execution stages.

Key Changes

1. Loop-Based Instrumentation Control

  • New Module: Introduced traincheck.instrumentor.control with start_step() and start_eval_step() functions.
    • start_step(): Increments the global training step counter and applies the configured policy (interval/warmup) to toggle instrumentation.
    • start_eval_step(): Manages a separate eval_step counter for evaluation loops, reusing the global policy.
  • Decoupling: Moved policy enforcement logic out of the optimizer.step() wrapper. This prevents issues where instrumentation state could become desynchronized or incorrectly applied outside of loop contexts.

2. Smart AST Injection (Source Instrumentation)

  • Enhanced Visitor: Updated InsertTracerVisitor in traincheck/instrumentor/source_file.py to intelligently detect loop contexts:
    • Training Loops: Identified by calls to optimizer.step() or loss.backward(). The visitor injects start_step().
    • Evaluation Loops: Identified by context (e.g., inside functions named test, eval, valid). The visitor injects start_eval_step().
  • Automatic Injection: The appropriate control function is automatically injected at the start of the loop body.

3. CLI & Configuration Updates

  • traincheck-collect Arguments:
    • Added --sampling-interval: Controls how frequently steps are instrumented (e.g., every Nth step).
    • Added --warm-up-steps: Specifies the number of initial steps to always instrument, regardless of the sampling interval.
  • Dynamic Policy: Removed static schedule generation; policies are now evaluated dynamically at runtime, allowing for more flexibility.

4. Robustness Improvements

  • Stage Transitions: Updated annotate_stage to reset DISABLE_WRAPPER to False upon entering a new stage. This ensures instrumentation is re-enabled by default when switching contexts (e.g., from Training to Validation), preventing state leakage.

Verification

  • Unit Tests:
    • Added tests/test_loop_injection.py to verify that AST transformations correctly identify loop types and inject the appropriate control calls.
    • Updated tests/test_dynamic_policy.py to verify the runtime logic of start_step and policy application.
    • Verified tests/test_policy_injection.py for CLI argument integration.
  • End-to-End Testing:
    • Verified with mnist.py example. Confirmed that trace logs show expected "Interval step" (instrumented) and "Skipping step" (skipped) behavior for both training and testing loops.

@Essoz Essoz merged commit cdca52e into main Feb 12, 2026
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant